An Analysis of Employee Statistics From Glassdoor

image.png

Aidan Henbest
Dr. Bixler
Computer Science
15 June 2022

Introduction

$\;\;\;\;\;\;$The data set that was chosen is from the report "How to Analyze Your Gender Pay Gap: An Employer’s Guide" on the website www.glassdoor.com. Specifically, the link to the data set comes from this page: https://www.glassdoor.com/research/how-to-analyze-gender-pay-gap-employers-guide/. This data set was chosen because of its relevance in today's world, in which pay equality is greatly disputed. The pay inequality that is often discussed, the gender pay gap, can be analyzed using this data set. However, this data set can be analyzed in many other ways too. It has extensive data on one thousand employees, including their job title, gender, age, performance evaluation score, education level, department, seniority level, yearly base pay, and bonus pay. All of this data can be utilized in many different ways to answer many different interesting questions regarding employee statistics. The questions that are answered in this analysis include these six:

While the depth of this data set could certainly be explored further, with the time allotted only these six questions were able to be analyzed. These six questions have interesting answers, and they will be answered later on in this analysis.

Initial Data Analysis

$\;\;\;\;\;\;$This function shows the title of each column in the main data frame: job title, gender, age, performance evaluation, education, department, seniority, base pay, bonus, percent bonus, and total pay. It also shows that each column has one thousand pieces of data in it. Lastly, this function shows the data type for each column. Job title is an object, gender is an object, age is an int64, performance evaluation is an int64, education is an object, department is an object, seniority is an int64, base pay is an int64, bonus is an int64, percent bonus is a float64, and total pay is an int64.

$\;\;\;\;\;\;$This function shows the amount of memory being taken up by each column in the main data frame, in bytes. From this we can see that job title takes up 70,468 bytes, gender takes up 61,936 bytes, age takes up 8,000 bytes, performance evaluation takes up 8,000 bytes, education takes up 64,108 bytes, department takes up 66,929 bytes, seniority takes up 8,000 bytes, base pay takes up 8,000 bytes, bonus takes up 8,000 bytes, percent bonus takes up 8,000 bytes, and total pay takes up 8,000 bytes.

$\;\;\;\;\;\;$This function shows the amount of memory being taken up by each column in the pay_num data frame, in bytes. This data frame has all of the data converted to numbers, which allows for some easier analysis. From this we can see that job title takes up 8,000 bytes, gender takes up 8,000 bytes, age takes up 8,000 bytes, performance evaluation takes up 8,000 bytes, education takes up 8,000 bytes, department takes up 8,000 bytes, seniority takes up 8,000 bytes, base pay takes up 8,000 bytes, bonus takes up 8,000 bytes, percent bonus takes up 8,000 bytes, and total pay takes up 8,000 bytes.

$\;\;\;\;\;\;$This function shows the amount of memory being taken up by each column in the pay_my_type data frame, in bytes. This data frame has the education column converted to a custom data type, which allows for the levels of education to be ordered correctly. From this we can see that job title takes up 70,468 bytes, gender takes up 61,936 bytes, age takes up 8,000 bytes, performance evaluation takes up 8,000 bytes, education takes up 1,428 bytes, department takes up 66,929 bytes, seniority takes up 8,000 bytes, base pay takes up 8,000 bytes, bonus takes up 8,000 bytes, percent bonus takes up 8,000 bytes, and total pay takes up 8,000 bytes.

$\;\;\;\;\;\;$This function shows the number of values missing from each column of the main data frame, but since all of the other data frames only convert values from the main one to a different data type, this information applies to all of the data frames. From this we can see that job title is missing zero values, gender is missing zero values, age is missing zero values, performance evaluation is missing zero values, education is missing zero values, department is missing zero values, seniority is missing zero values, base pay is missing zero values, bonus is missing zero values, percent bonus is missing zero values, and total pay is missing zero values.

$\;\;\;\;\;\;$This function shows the cumulative sums of all of the values in each column of the pay_num data frame. From this we can see that job title sums to 4,542; gender sums to 468; age sums to 41,393; performance evaluation sums to 3,037; education sums to 1,467;department sums to 1,950;seniority sums to 2,971;base pay sums to \$94,472,653; bonus sums to \\$6,467,161;percent bonus sums to 7,515.73%;and total pay sums to \$100,939,814.

$\;\;\;\;\;\;$This function shows the first 10 rows of the main data frame, including all of the values from each column in the data frame.

$\;\;\;\;\;\;$This function shows the last 10 rows of the main data frame, including all of the values from each column in the data frame.

$\;\;\;\;\;\;$This function performs basic statistical analysis on the pay_num data frame. This statistical analysis includes functions like the count, mean, standard deviation, minimum, 25th percentile, 50th percentile, 75th percentile, and maximum. This statistical analysis is performed on every column of the data frame. From this we can see many things; however, the most important values include the mean gender, age, performance evaluation, education, seniority, base pay, bonus, percent bonus, and total pay. The mean age is 41.29, the mean performance evaluation is 3.04 (out of 5), the mean seniority is 2.97 (out of 5), the mean base pay is \$94,472.65, the mean bonus is \\$6,467.16, the mean percent bonus is 7.52%, and the mean total pay is \$100,939.81. The mean gender is 0.47, which means that there is a fairly even number of male and female employees, albeit slightly leaning towards the male side. The mean education is 1.47, which means that most employees have at least completed a college education.

$\;\;\;\;\;\;$This function shows the number of values in each category of the job title column of the main data frame. This shows that there are 118 marketing associates, 109 software engineers, 107 financial analysts, 107 data scientists, 98 graphic designers, 96 IT employees, 94 sales associates, 91 drivers, 90 warehouse associates, and 90 managers. This data shows a fairly even distribution of employees throughout the ten job titles.

$\;\;\;\;\;\;$This function shows the number of values in each category of the gender column of the main data frame. This shows that there are 532 males and 468 females. This data shows a fairly even distribution of employees between the two genders.

$\;\;\;\;\;\;$This function shows the number of values in each category of the performance evaluation column of the main data frame. This shows that there are 209 employees with a 5 performance evaluation score, 207 employees with a 4 performance evaluation score, 198 employees with a 1 performance evaluation score, 194 employees with a 3 performance evaluation score, and 192 employees with a 2 performance evaluation score. This data shows a fairly even distribution of employees throughout the five performance evaluation scores.

$\;\;\;\;\;\;$This function shows the number of values in each category of the education column of the main data frame. This shows that there are 265 employees with a high school level education, 256 employees with a masters level education, 241 employees with a college level education, and 238 employees with a PhD level education. This data shows a fairly even distribution of employees throughout the four levels of education.

$\;\;\;\;\;\;$This function shows the number of values in each category of the department column of the main data frame. This shows that there are 210 employees in operations, 207 employees in sales, 198 employees in management, 193 employees in administration, and 192 employees in engineering. This data shows a fairly even distribution of employees throughout the five departments.

z$\;\;\;\;\;\;$This function shows the number of values in each category of the performance evaluation column of the main data frame. This shows that there are 219 employees with a seniority level of 3, 209 employees with a seniority level of 2, 195 employees with a seniority level of 1, 193 employees with a seniority level of 5, and 184 employees with a seniority level of 4. This data shows a fairly even distribution of employees throughout the five seniority levels.

$\;\;\;\;\;\;$This function groups the data by the gender and seniority columns of the main data frame and then performs the mean of each piece of numerical data in all of the subgroups created by this grouping. From this, it can be seen that the mean age of each seniority level, in both the male and female categories, hovers around forty years old. Furthermore, the performance evaluation scores at each seniority level, for each gender, are around three. In addition, it can be seen that the mean base pay and total pay are substantially higher for each seniority level in the male section. However, the bonus and percent bonus columns do not similarly reflect this, they are about the same for each gender, if not slightly higher for the female section.

$\;\;\;\;\;\;$This function performs a cross-tabulation of the gender and seniority columns of the main data frame. This shows that the number of employees in each gender and seniority level subgroup is about the same, as all of the values are around 0.1, or 10%.

$\;\;\;\;\;\;$This function creates a correlation graph of the numerical data from the pay_num data frame. The higher the absolute value of a box in this graph, the higher the correlation is between those two columns. The values can range from a negative one to one. Based on this, it can be seen that the base pay and age columns have a strong correlation, the bonus and age columns have a strong correlation, the percent bonus and age columns have a strong correlation, the total pay and age columns have a strong correlation, the bonus and performance evaluation columns have a strong correlation, the percent bonus and performance evaluation columns have a strong correlation, the base pay and seniority columns have a strong correlation, the total pay and seniority columns have a strong correlation, the percent bonus and base pay columns have a strong correlation, the total pay and base pay columns have a strong correlation, the percent bonus and bonus columns have a strong correlation, and the total pay and percent bonus columns have a strong correlation.

Question 1: What impact does gender have on the pay statistics of employees?

$\;\;\;\;\;\;$Many different types of graphs were created to analyze the difference in a variety of pay statistics by gender. The pay statistics analyzed include base pay, bonus, total pay, and percent bonus. The types of graphs created include bar graphs, box plots, strip plots, violin plots, and swarm plots. From these graphs, it can be seen that the pay statistics for both males and females have a wide range of values. Despite this, there are not too many outliers in the data in any of the categories other than the percent bonus column. The percent bonus column has many more outliers than the other three columns being analyzed. In addition, it can be seen that the base pay and total pay for male employees are higher than that of female employees. While this result was expected, it was still good to confirm it. More surprisingly, the bonus that employees received did not differ much between genders, as females received slightly higher bonuses. As a result of this, females generally have a substantially higher percent bonus than men. Since the bonuses of females are generally slightly higher than that of males, but their base pays are generally quite a bit lower, their percent bonuses are quite a bit higher. This is surprising, as it would have been expected that both the bonus values and percent bonus values for females would have been lower than those of males. In conclusion, while males have a substantially higher base and total pay, females surprisingly have higher bonuses and percent bonuses than males.

Question 2: Does the gender pay gap differ between different jobs and departments?

$\;\;\;\;\;\;$Many different types of graphs were created to analyze the difference in a variety of pay statistics by department, job title, and gender. The pay statistics analyzed include base pay, bonus, total pay, and percent bonus. The types of graphs created include bar graphs, box plots, strip plots, violin plots, and swarm plots. From these graphs, it can be seen that the gender pay gap varies greatly between different jobs and departments. Specifically, between jobs, the gender pay gap varies immensely. While between each department the gender pay gap varies slightly, the general trends follow the trends experienced by all of the data as a whole. The only exceptions to this are that in the management, administration, and engineering departments, men have a higher mean bonus than women. Other than this, the department data follows the main trends with only slight variation. Contrastingly, each job seems to follow its own rules. For instance, the base pay and total pay values for the graphic designers, warehouse associates, financial analysts, data scientists, and managers are higher for females than they are for males. This is extremely surprising, as this is the exact opposite trend that is followed by the rest of the data. In addition, the male graphic designers, software engineers, drivers, financial analysts, marketing associates, and managers have on average a higher bonus, which contrasts with the trend followed by the majority of the data in which females have a slightly higher bonus. This trend may be the result of one job, warehouse associate, in which the females have a substantially higher mean bonus compared to males. This one job may skew the rest of the data in this direction. Lastly, the male graphic designers, financial analysts, and managers also have a higher percent bonus compared to the females. This is also strange and does not follow the expected trend of the data as a whole. In conclusion, the gender pay gap surprisingly varies slightly between each department, and it, also surprisingly, varies greatly between each job.

Question 3: Does the gender pay gap differ between different performance evaluation scores of employees?

$\;\;\;\;\;\;$Many different types of graphs were created to analyze the difference in a variety of pay statistics by performance evaluation score and gender. The pay statistics analyzed include base pay, bonus, total pay, and percent bonus. The types of graphs created include line plots, scatter plots, box plots, strip plots, violin plots, and swarm plots. From these graphs, it can be seen performance evaluation scores play a large role in the allocation of bonuses, and the percent bonus an employee gets, but plays very little role in the base pay and total pay of an employee. This can be seen as there is a clear positive trend when performance score is compared to bonus and when performance score is compared to percent bonus. Contrastingly, there is no trend between performance score and base pay as well as between performance score and total pay, since these graphs are close to horizontal. This is unexpected, it would have been predicted that when the performance evaluation score is compared to all four pay statistics, there would have been a positive trend. However, this is only true for bonus and percent bonus, while base pay and total pay have a horizontal trend. Furthermore, from these graphs, it can be seen that the gender pay gap does not differ between different performance scores. No matter the performance score the data still follows the general trends initially discovered in the data. Males still have higher base pay and total pay, while females still have higher bonuses and percent bonuses. This information is as expected and unsurprising. In conclusion, performance evaluation scores affect bonuses and percent bonuses, but not base pay and total pay, also the gender pay gap follows the expected trend for all performance scores, so the gender pay gap does not differ between different performance scores.

Question 4: What impact does seniority have on the pay statistics of employees?

$\;\;\;\;\;\;$Many different types of graphs were created to analyze the impact that seniority has on the pay statistics of employees. The pay statistics analyzed include base pay, bonus, total pay, and percent bonus. The types of graphs created include line plots, scatter plots, box plots, strip plots, violin plots, and swarm plots. Seniority has a positive correlation with base pay, bonus, and total pay while it has a negative correlation with percent bonus. This is because while the positive correlation between seniority and base pay and total pay has a high slope, the positive correlation between seniority and bonus has a low slope. This results in a negative correlation between the seniority and percent bonus. This is an interesting result, not entirely unexpected, but also not entirely expected. While the positive correlations between seniority and base pay, bonus, and total pay were entirely expected, the negative correlation between seniority and percent bonus is unexpected. In conclusion, as one becomes a more senior employee, their base pay, bonus, and total pay increase; however, their percent bonus decreases.

Question 5: Does education level affect the likelihood to become a senior employee?

$\;\;\;\;\;\;$Many different types of graphs were created to analyze the impact that education level has on the likelihood that an employee becomes a senior employee. The types of graphs created include line plots, scatter plots, box plots, strip plots, violin plots, and swarm plots. All of these graphs show a nearly horizontal correlation. This means that the education level of the employee does not affect their likelihood to become a senior employee. This is completely unexpected, as it would have been thought that the company would have tried its best to retain those with higher education, resulting in those with a higher level of education having a higher likelihood to become senior employees. However, this is not the case, and the education level of the employee does not affect their likelihood to become a senior employee. In conclusion, these surprising results show that education level is not important when it comes to a company's attempts at retaining its employees.

Question 6: Does seniority affect the performance evaluation score of employees?

$\;\;\;\;\;\;$Many different types of graphs were created to analyze the impact that seniority has on the performance evaluation score that an employee obtains. The types of graphs created include line plots, scatter plots, box plots, strip plots, violin plots, and swarm plots. All of these graphs show a nearly horizontal correlation. This means that seniority does not affect the performance evaluation score that an employee obtains. Therefore, there does not appear to be bias by the performance evaluator toward senior employees, they can get any score just like the rest of the employees. This is expected, the likelihood that there was a substantial amount of bias was minimal, but it was still determined to be worth testing. Also, this horizontal correlation shows that those with higher performance evaluations have a higher likelihood to become senior employees. This company does not appear to prioritize those with high performance evaluation scores over those with lower ones, their employee retention is about the same for all performance scores. This is unexpected, it was thought that the company would have tried harder to retain employees with higher performance scores over those with lower performance scores, but this does not appear to be the case. It appears that the company tries to retain all employees equally. In conclusion, this company does not have a bias towards senior employees, nor does it try harder to retain employees with high performance evaluation scores over those with low performance evaluation scores.

Summary

$\;\;\;\;\;\;$To conclude, many correlations have been found in this data set of one thousand employees from Glassdoor. This data set has shown that while males have a higher base pay and total pay, females have a higher bonus and percent bonus. However, these statistics vary greatly through different jobs and departments. Furthermore, while those with higher performance scores receive higher bonuses and percent bonuses, they do not receive higher base pay and total pay. Also, the gender pay gap does not appear to differ throughout different performance scores and there does not appear to be a bias against any gender when determining performance scores. Next, this data set showed that while seniority influences base pay, bonus, and total pay positively, it influences percent bonus negatively. Lastly, this data set also showed that the education level and the performance evaluation score of an employee do not affect their ability to become a senior employee. In conclusion, this data set has resulted in many interesting discoveries regarding employee statistics. However, while this analysis has attempted to be as complete as possible, it did not explore many of the other data values included in the data set. For example, the age column of data could have shown some interesting correlations between it and many other columns of data, but this was not explored at all. If this project were to be done again in the future, the age column would have most definitely been explored further. Be that as it may, this project was challenging as is, creating some of the graphs and subplots was very difficult, so there was not enough time in this instance to explore the age column thoroughly.